Perform hierarchical clustering with single, complete and average linkage using the iris data. You could also look at other method’s such as Ward’s method (see ?hclust for details).
In each case, cut the dendrogram to give three distinct groups, and compute the confusion matrix comparing the clusters found with the Species label. Comment on which linkage method has worked best in this case.
The dendrograms show that Ward’s method most clearly finds three distinct clusters. Three clusters are also clear(ish) in the complete linkage method, but the single and average linkage method do not really suggest there are three natural groups, with two clusters looking more natrual from both dendrograms (Setosa versus the other species).
The confusion matrices show that when cut into 3 clusters, Ward’s method and average linkage do a good job of finding the three groups of species. The single linkage method does a poor job. Complete linkage does not separate versicolor from virginica very well.
We do not normally know the species/cluster labels when carrying out cluster analysis, and so can we still say anything about which method is better if you were expecting to see three distinct groups?
Solution
It would depend on what feature you are expecting to see. If we were told that there should be three groups, then complete linkage and Ward’s method both give three clearly distinct groups whereas single linkage only gives two clear clusters. So, complete linkage and Ward’s method would appear to be better if the purpose is to form three distinct clusters.
Compare the hierarchical clustering methods with the results of doing K-means clustering and model-based clustering (assuming multivariate normal distributions for each population).
We can see this is pretty good at finding the three natural clusters (note the labels 1, 2, 3 don’t mean anything), with similar performance to the hierarchical clustering answer with Ward’s method.
In terms of finding the natural clusters, this has worked spectacularly well.
Task 2
Download the Indian Premier League data from Moodle and load it into R. We will filter the data to only look at players who played at least 10 innings, and the select just the information on the number of runs they scored, their high score (HS), their batting average (Avg), their best figures (BF), their strike rate (SR), and the number of 4s and 6s they hit.
Apply agglomerative hierarchical clustering to these data. You will need to first compute a distance matrix for the data, which can be done with the dist command.
Using the Euclidean distance, do single linkage, complete linkage, and average linkage give similar dendrograms?
The complete and average linkage dendrograms are very close, but single linkage is completely different. To see that the first two dendrograms are similar, lets use both of them to create 4 clusters, and then look at the similarities.
Remember the cluster labels are arbitrary, so in this case can see the 4 clusters found are remarkably similar with only a single difference in cluster assignment.
Do your results change much if we use a different distance measure (e.g. Manhattan)?
We can see that using the Manhattan distance has not made a big difference to the dendrograms, which can be checked by looking at the cluster assignments.
There are several R packages for creating different types of dendrogram plots. Have a look at the link here and try creating an alternative type of dendrogram. Consider whether this has helped communicate anything of interest about the data.
Solution
library("ape")tmp<-IPL10[,2:8]rownames(tmp)<-IPL10$PLAYERD<-dist(tmp, method ="euclidean")IPL.comp<-hclust(D, method="complete")plot(as.phylo(IPL.comp), type ="cladogram", cex =0.6, label.offset =0.5,)
Personally, I don’t find these variations useful, but it is a matter of personal taste. If I knew more about the IPL, I might be able to explain some of the clusters that appear here (e.g., attacking aggressive batsman, slow-scorers, bowlers, etc)
Task 3
Look at the data stored in the USArrests data frame in R. You can read about the data by typing ?USArrests.
Apply a selection of clustering methods to these data and discuss how many clusters appear to be present.
Solution
Let’s just look at a complete linkage dendrogram, and k-means clustering.